5 research outputs found

    Estimation and Modelling of Errors in the Library Preparation Stage of Next Generation Sequencing

    Get PDF
    Next-generation sequencing has empowered genomics by making it possible to sequence genomes at a lower cost and less time compared to the traditional Sanger method. However, these improvements suffer from reduced accuracy when compared with the Sanger method. During the library preparation stage of sequencing, artefacts can be introduced that affect the reliability of a read. These artefacts can arise from biases due to the structure of the genome, such as preferential splitting of DNA between specific nucleotides, bias of adapter ligation towards certain base pair identities, and temperature dependent denaturation due to nucleotide composition. To investigate these issues a library preparation model was developed to simulate the occurrences and investigate effects of such artefacts. The implemented model simulates the DNA fragmentation, adapter ligation and PCR amplification stages of the library preparation process. A set of parameters characterizing these steps and a DNA sequence are used as input and the output is an array of values representing the number of DNA fragments that cover each position of the input sequence (“coverage”). To validate the model a Genetic Algorithm (GA) was used to find parameters that would lead to coverage values that are closely similar to what is found in empirical sequencing data. The GA was able to acquire such parameters for a subsection of the Mycobacterium tuberculosis and Plasmodium falciparum genomes but failed when applied to the TP53 gene of the Homo sapiens genome. From this it was deduced that the model was better at predicting coverage when applied to genomes with subregions of nucleotide repeats. To find the effects of parameters representing each step of the library preparation process the model was applied to a set of in silico generated DNA that represent different sequence structures (GC-rich, AT-rich, neutral composition and a sequence with specific areas of GC and AT rich repeats). My study found that the parameters for the fragmentation, adapter ligation and PCR steps affected coverage. I also found that a combination of parameters between consecutive steps further affected coverage. In the fragmentation step, large fragment size had a negative effect on coverage (p = 0.0), in the adapter ligation step, coverage of AT-rich sequences was affected by a terminal bias (p = 0.0). Modifying parameters for the PCR step affected the coverage of both GC and AT rich sequences due to a temperature dependent bias. Finally, an interaction between the parameters of fragmentation and other steps were found to further reduce coverage. This simulation was able to suggest parameters that need to be fine-tuned to improve coverage

    Proceedings of Abstracts Engineering and Computer Science Research Conference 2019

    Get PDF
    © 2019 The Author(s). This is an open-access work distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. For further details please see https://creativecommons.org/licenses/by/4.0/. Note: Keynote: Fluorescence visualisation to evaluate effectiveness of personal protective equipment for infection control is © 2019 Crown copyright and so is licensed under the Open Government Licence v3.0. Under this licence users are permitted to copy, publish, distribute and transmit the Information; adapt the Information; exploit the Information commercially and non-commercially for example, by combining it with other Information, or by including it in your own product or application. Where you do any of the above you must acknowledge the source of the Information in your product or application by including or linking to any attribution statement specified by the Information Provider(s) and, where possible, provide a link to this licence: http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/This book is the record of abstracts submitted and accepted for presentation at the Inaugural Engineering and Computer Science Research Conference held 17th April 2019 at the University of Hertfordshire, Hatfield, UK. This conference is a local event aiming at bringing together the research students, staff and eminent external guests to celebrate Engineering and Computer Science Research at the University of Hertfordshire. The ECS Research Conference aims to showcase the broad landscape of research taking place in the School of Engineering and Computer Science. The 2019 conference was articulated around three topical cross-disciplinary themes: Make and Preserve the Future; Connect the People and Cities; and Protect and Care

    "Features of mapping a single chromosome by different aligners, revealed by a novel quality assessment method."

    No full text
    <div>Data used in publication:</div><div><b>ABED</b>.zip: directory containing model and all aligners sequences in BED format<br></div><div><b>BAM</b>.zip: directory containing model and all aligners sequences in BAM format, <i>case3</i> (mapping of reads from a whole genome to a reference genome)<br></div><div><b>BAM</b>.tar.gz: the same as BAM.zip<br></div><div><b>ReadMe</b>.txt: this text + file naming convention<br></div><div><b>ProcessedDataAlign</b>.zip: 4 tables containing the results of calculations and graphs<br></div><div><br></div

    Novel read density distribution score shows possible aligner artefacts, when mapping a single chromosome

    No full text
    Abstract Background The use of artificial data to evaluate the performance of aligners and peak callers not only improves its accuracy and reliability, but also makes it possible to reduce the computational time. One of the natural ways to achieve such time reduction is by mapping a single chromosome. Results We investigated whether a single chromosome mapping causes any artefacts in the alignments’ performances. In this paper, we compared the accuracy of the performance of seven aligners on well-controlled simulated benchmark data which was sampled from a single chromosome and also from a whole genome. We found that commonly used statistical methods are insufficient to evaluate an aligner performance, and applied a novel measure of a read density distribution similarity, which allowed to reveal artefacts in aligners’ performances. We also calculated some interesting mismatch statistics, and constructed mismatch frequency distributions along the read. Conclusions The generation of artificial data by mapping of reads generated from a single chromosome to a reference chromosome is justified from the point of view of reducing the benchmarking time. The proposed quality assessment method allows to identify the inherent shortcoming of aligners that are not detected by conventional statistical methods, and can affect the quality of alignment of real data
    corecore